Detection of Foreign Entities in Native Text Using N-gram Based Cumulative Frequency Addition

نویسندگان

  • Bashir Ahmed
  • Sung-Hyuk Cha
  • Charles Tappert
چکیده

This paper describes a logarithmic version of the conventional Naïve Bayesian N-gram-based, textclassification algorithm that we name Cumulative Frequency Addition (CFA) and its application in three tasks: language identification, nationality identification from names, and detection of foreign words in base text. The new CFA technique is 3-10 times faster than N-gram based rank-order statistical classifiers. In the language identification task CFA yields 100% accuracy on string sizes greater than 150 characters. In the name-tonationality task, it yields 86% accuracy on a 14 country database and 96% on a 7 country database within the top three choices. Finally, in the task of detecting foreign words it yields 66.9% accuracy. This is the first study to apply natural language processing techniques to such tasks as name identification and foreign word detection.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language Identification from Text Using N-gram Based Cumulative Frequency Addition

This paper describes the preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler than the conventional Naïve Bayesian classification method, but it performs similarly in speed overall and better in accuracy on short input strings. The classifier is also 5-10 times faster than N-gram based rank-...

متن کامل

Named Entity Recognition in Persian Text using Deep Learning

Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...

متن کامل

An unsupervised method for identifying loanwords in Korean

This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Expectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming trace...

متن کامل

Capturing Out-of-Vocabulary Words in Arabic Text

The increasing flow of information between languages has led to a rise in the frequency of non-native or loan words, where terms of one language appear transliterated in another. Dealing with such out of vocabulary words is essential for successful cross-lingual information retrieval. For example, techniques such as stemming should not be applied indiscriminately to all words in a collection, a...

متن کامل

Comparing Lexical Bundles in Hard Science Lectures; A Case of Native and Non-Native University Lecturers

Researchers stated that learning and applying certain set of lexical bundles of native lecturers by non-native lecturers would help students improve their proficiency through incidental vocabulary input. The present study shed light on the lexical bundles in hard science lectures used by Native and Non-native lecturers in international universities with the main purpose of analyzing the structu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005